Learning R with data

I’m going to do a very simple analysis of Baltimore crime to show introduce some R basics. We’ll use data downloaded from Baltimore City’s awesome open data site (this was downloaded a couple of years ago so if you download now, you will get different results).

Getting started

In Rstudio let’s open a new R script, we will use this to try out commands, and later to write functions that perform a few steps at a time. The first command we’ll try is the following

Let’s read data from a comma-separated text file:

So that command read arrest records from the file, and printed them out. This type of command is a function call. read.csv is a function that performs some number of operations (in this case read data from a file). The function needs to know what to operate on, that’s what arguments are for. The file argument says where the file is located, the header argument says “yes, the first line in the file contains column names”, and the stringsAsFactors argument does some esoteric work that we will ignore for the moment.

Variables

Now this was a useful command but we couldn’t do much with the result, we need to be able to remember things. To do that we need to give the result a name:

arrest_tab <- read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)

This command says “assign” to the name arrest_tab the result of calling read.csv with those arguments. We call these names variables because the can vary, i.e., they can change value.

# assign value to variable
my_city <- "Mexico D.F."

# print the current value of this variable
my_city
## [1] "Mexico D.F."
# change the value of the variable
my_city <- "Monterrey"

# print the new value
my_city
## [1] "Monterrey"

You should think of variables as boxes where we store things we want to remember. Every time we use the variable in a command, it will take whatever is in the box. Every time we assign a new value to the variable, we change the contents of the box.

Exploring data

Now that we have a way of remembering the data we read from the file, we can start to take a look at what it is. First of all, what kind of value is in the arrest_tab box?

class(arrest_tab)
## [1] "data.frame"

The data.frame type is one of the most important types in R. It corresponds to a data table (like a sheet in a spreadsheet). Each row corresponds to an observation, and columns to characteristics of each observation. Let’s see how many rows and columns we have:

# how many rows and columns does the data.frame have?
dim(arrest_tab)
## [1] 104528     15
# if we only wanted the number of rows
nrow(arrest_tab)
## [1] 104528

What are the column names?

# what are the columns
colnames(arrest_tab)
##  [1] "arrest"            "age"               "sex"              
##  [4] "race"              "arrestDate"        "arrestTime"       
##  [7] "arrestLocation"    "incidentOffense"   "incidentLocation" 
## [10] "charge"            "chargeDescription" "district"         
## [13] "post"              "neighborhood"      "Location.1"

Indexing data

One of the most fundamental operation in data analysis is selecting data to analyze. For a data.frame we use indexing to do this.

# This command selects the entry in row 1 column 5
arrest_tab[1,5]
## [1] "01/01/2011"

To select from than onve value we can use a slice.

# This command selects rows 1 through 10 of column 5
arrest_tab[1:10,5]
##  [1] "01/01/2011" "01/01/2011" "01/01/2011" "01/01/2011" "01/01/2011"
##  [6] "01/01/2011" "01/01/2011" "01/01/2011" "01/01/2011" "01/01/2011"

We can use the same notation in columns:

# Columns 1 through 5 of row 5
arrest_tab[5,1:5]
##     arrest age sex race arrestDate
## 5 11126968  33   B    M 01/01/2011

What is the difference between selecting multiple rows of a single column, vs. selecting multiple columns of a single row?

We can also combine slices for rows and columns

# What does this select?
arrest_tab[1:10,1:5]
##      arrest age sex race arrestDate
## 1  11126858  23   B    M 01/01/2011
## 2  11127013  37   B    M 01/01/2011
## 3  11126887  46   B    M 01/01/2011
## 4  11126873  50   B    M 01/01/2011
## 5  11126968  33   B    M 01/01/2011
## 6  11127041  41   B    M 01/01/2011
## 7  11126932  29   B    M 01/01/2011
## 8  11126940  20   W    M 01/01/2011
## 9  11127051  24   B    M 01/01/2011
## 10 11127018  53   B    M 01/01/2011

If we want to select non-consecutive rows or columns, instead of a slice we use a vector of indices. To construct a vector in R, we use the c function (c stands for “concatenate”)

# This selects rows 2,4,7,10 and the first five columns
arrest_tab[c(2,4,7,10),1:5]
##      arrest age sex race arrestDate
## 2  11127013  37   B    M 01/01/2011
## 4  11126873  50   B    M 01/01/2011
## 7  11126932  29   B    M 01/01/2011
## 10 11127018  53   B    M 01/01/2011

In fact the slice notation we used previously is shorthand for “create a vector of consecutive indices”. If we want to select all rows or columns, we just don’t pass any indexing vector

arrest_tab[c(2,4,7,10),]
##      arrest age sex race arrestDate arrestTime    arrestLocation
## 2  11127013  37   B    M 01/01/2011      00:01  2000 Wilkens Ave
## 4  11126873  50   B    M 01/01/2011      00:04 2100 Ashburton St
## 7  11126932  29   B    M 01/01/2011      00:05   800 N Monroe St
## 10 11127018  53   B    M 01/01/2011      00:15 3300 Woodland Ave
##    incidentOffense         incidentLocation charge
## 2         79-Other Wilkens Av & S Payson St 1 1425
## 4         79-Other        2100 Ashburton St 1 1106
## 7         79-Other          800 N Monroe St 1 5212
## 10 54-Armed Person         3300 Woodland Av 1 1425
##                              chargeDescription     district post
## 2  Reckless Endangerment || Hand Gun Violation     SOUTHERN  934
## 4        Reg Firearm:Illegal Possession || Hgv      WESTERN  735
## 7       Handgun On Person || Handgun Violation      WESTERN  724
## 10                Reckless Endangerment || Hgv NORTHWESTERN  614
##              neighborhood                      Location.1
## 2        Carrollton Ridge (39.2814026274, -76.6483635135)
## 4  Panway/Braddish Avenue (39.3117196723, -76.6623546313)
## 7       Midtown-Edmondson (39.2979815407, -76.6475113571)
## 10   Central Park Heights (39.3436773374, -76.6727297618)

Challenges: 1. Select rows 20 through 30, columns 5 through 10 2. Select rows 10,20,30,40, all columns 3. The function seq is a very powerful way of defining indices, see ?seq to get more information. Use seq to select the first ten rows and columns (equivalent to arrest_tab[1:10,1:10] without using slice notation) 4. Use seq to select the odd numbered rows (i.e., 1,3,5,7…). You can use the by argument in seq to do this. You can also use the nrow function we saw before. ```

Since data.frames have column names, we can index them using specific column names.

# select the first ten entries in the age column
arrest_tab[1:10,"age"]
##  [1] 23 37 46 50 33 41 29 20 24 53

You can also select more than one column by giving a vector of column names

arrest_tab[1:10,c("age","sex","race")]
##    age sex race
## 1   23   B    M
## 2   37   B    M
## 3   46   B    M
## 4   50   B    M
## 5   33   B    M
## 6   41   B    M
## 7   29   B    M
## 8   20   W    M
## 9   24   B    M
## 10  53   B    M

Now, to confuse you a little bit. R has a special notation to index a single column from a data.frame using the $ symbol. This selects all entries in the age column as well:

# select the first 10 entries in the age column
arrest_tab$age[1:10]
##  [1] 23 37 46 50 33 41 29 20 24 53

Challenge:

  1. Write three different ways to select entries 20 through 30 of the sex column.

The last way to index a data.frame that we’ll see is through “logical” indices. To select rows we build a vector as along as the number of rows, and each entry in the vector states “yes, I want this entry” (TRUE), or “no, I don’t want this entry”. Let’s look at a smaller example

# make a vector containing the first 10 letters in the alphabet 
my_vector <- letters[1:10]
my_vector
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Let’s use a logical index vector to select the first five entries

# first five entries with a logical index
my_vector[c(TRUE,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE)]
## [1] "a" "b" "c" "d" "e"
# and now the odd entries
my_vector[c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)]
## [1] "a" "c" "e" "g" "i"

Now this is very tedious of course, but it opens up a great way of interacting with data. For example, this command returns a logical vector stating which rows have age less than 21

# make a logical vector for rows with age less than 21
arrest_tab$age < 21

We can now use that to index, for example, the sex column only for rows with age less that 21

arrest_tab$sex[arrest_tab$age < 21]

# equivalently
arrest_tab[arrest_tab$age < 21, "age"]

Challenges

  1. Select the first 5 columns for rows where neighborhood equals “Mount Washington”, notice that the symbol == check equality.

More Exploration

Now that we know how to select portions of a data.frame let’s dig a little deeper. The summary function in R provides a lot of information about data.frames and vectors. This summarizes the age column

# summary of values in the age column
summary(arrest_tab$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    23.0    30.0    33.2    43.0    87.0

Here are summaries of the sex and race columns.

summary(arrest_tab$sex)
##    Length     Class      Mode 
##    104528 character character
summary(arrest_tab$race)
##    Length     Class      Mode 
##    104528 character character

Not very informative, but let’s use a seemingly magical feature of R. It understands that if you have a vector of strings sometimes you want to treat it in a special way when you are doing data analysis. We use the factor function to tell R that entries in a string vector can take one of a given number of possible values, and that we want to anlayze characteristics of those values.

# summarize sex and race as factors
summary(factor(arrest_tab$sex))
##           A     B     H     I     U     W 
##     2   242 87268     1   218  1749 15048
summary(factor(arrest_tab$race))
##           F     M 
##     2 19431 85095

We’re already capable of doing another essential aspect of data analysis: cleaning! The sex and race columns are mislabled. Let’s fix those labels

# let's see those names again
colnames(arrest_tab)
##  [1] "arrest"            "age"               "sex"              
##  [4] "race"              "arrestDate"        "arrestTime"       
##  [7] "arrestLocation"    "incidentOffense"   "incidentLocation" 
## [10] "charge"            "chargeDescription" "district"         
## [13] "post"              "neighborhood"      "Location.1"
# ok so sex is the third column and race the fourth column, let's rename them
colnames(arrest_tab)[c(3,4)] <- c("race", "sex")

# let's see how we did
summary(factor(arrest_tab$race))
##           A     B     H     I     U     W 
##     2   242 87268     1   218  1749 15048
summary(factor(arrest_tab$sex))
##           F     M 
##     2 19431 85095

That’s better!

Let’s see what else we can learn about this dataset!

# what is the average arrest age?
mean(arrest_tab$age)
## [1] 33.19639
# and the median?
median(arrest_tab$age)
## [1] 30
# the range of arrest ages
range(arrest_tab$age)
## [1]  0 87
# the minimum arrest age?
min(arrest_tab$age)
## [1] 0
# how many arrests per sex
table(arrest_tab$sex)
## 
##           F     M 
##     2 19431 85095
# what about the mean arrest age in Mount Washington
mean(arrest_tab[arrest_tab$neighborhood == "Mount Washington", "age"])
## [1] 31.10345

Now this is good, but it would be great to see what this data looks like. This is where visualization comes in, which we will discuss more later in the course.

Let’s start with a boxplot of arrest ages

boxplot(arrest_tab$age, ylab="Arrest Age")

Maybe there is some difference between sexes? Let’s use another magical R feature: the x~y usually means “perform this command to x, but condition on y”. We’ll talk a lot about “conditioning” in this course.

boxplot(arrest_tab$age~arrest_tab$sex, ylab="Arrest Age", xlab="Sex")

Challenge:

  1. Make a boxplot of arrest age conditioned on race.

Another useful plot is a barplot of number of arrests by race

barplot(table(arrest_tab$race), xlab="Race", ylab="Number of arrests")

Functions

Ok, so those are three things we would like to do in our analysis: boxplots conditioned on sex and race, and a barplot of number of arrests by race. An interesting analysis would be to see how these plots look like for different neighborhoods. Let’s take “Mount Washington” again

# first, let's select age, race and sex for Mount Washing
mt_washington_arrests <- arrest_tab[arrest_tab$neighborhood == "Mount Washington", c("age", "race", "sex")]

# now let's make those plots
boxplot(mt_washington_arrests$age~mt_washington_arrests$sex, xlab="Sex", ylab="Arrest Age", main="Mount Washington")

boxplot(mt_washington_arrests$age~mt_washington_arrests$race, xlab="Race", ylab="Arrest Age", main="Mount Washington")

barplot(table(mt_washington_arrests$race), xlab="Race", ylab="Number of Arrests", main="Mount Washington")

This is great, but it’s going to get tedious to repeat all of this for other neighborhoods. When there are a number of operations with different subsets of data we can group them together into functions. This function will make these three plots for a given neighborhood passed as the neighborhood argument

# make analysis plots for given neighborhood
analyze_neighborhood <- function(neighborhood) {
  # subset arrest table to records from given neighborhood
  # only will use age, sex and race columns
  arrest_subset <- arrest_tab[arrest_tab$neighborhood == neighborhood, c("age", "sex", "race")]
  
  # make a boxplot of age conditioned on sex
  boxplot(arrest_subset$age~arrest_subset$sex, xlab="Sex", ylab="Arrest Age", main=neighborhood)
  
  # make a boxplot of age conditioned on race
  boxplot(arrest_subset$age~arrest_subset$race, xlab="Race", ylab="Arrest Age", main=neighborhood)
  
  # a barplot of number of arrests per race
  barplot(table(arrest_subset$race), xlab="Race", ylab="Number of Arrests", main=neighborhood)
}

When you execute this command nothing is printed in the R console, since what we did was assign the name analyze_neighborhood to the funtion we wrote. Now that we have a name for it we can call it on different neighborhoods

analyze_neighborhood("Roland Park")

analyze_neighborhood("Hampden")

This is much cleaner than writing the same four commands over and over for each neighborhood. Also, if you wanted to change something in the plots (say, the x-axis label) you only need to do it once in the function definition. This is also less prone to errors.

Writing functions is an important aspect of reproducible, robust, and clean data analysis.

Challenge: 1. Add one more command to the analyze_neighborhood function to print out the range of arrest ages in the given neighborhood.

Loops (how to repeat yourself)

Actually, let’s see how many neighborhoods are there

# the unique function returns a vector of the values that appear in a given vector
neighborhoods <- unique(arrest_tab$neighborhood)
length(neighborhoods)
## [1] 268

Hmm, if we wanted to analyze every neighborhood we would need to write over 250 lines of code that look very similar. We can do better with loops. Let’s see how we can do this for the first 10 neighborhoods with a for loop:

# remember that we have got the neighborhood names already
for (neighborhood in neighborhoods[1:10]) {
  analyze_neighborhood(neighborhood)
}

R has a handy shorthand for this type of operation. The sapply function is used to apply some function to each entry in a vector. E.g., this is equivalent to the for loop we wrote just now:

sapply(neighborhoods[1:10], analyze_neighborhood)

Much cleaner.

Making decisions

Maybe we only want to do the analysis on neighborhoods that have more than 500 arrests on record. We can use if statements to make decisions. Let’s add an if statement to our analyze_neighborhood function:

analyze_neighborhood <- function(neighborhood) {
  # subset to arrests from given neighborhood
  # will only use age, sex and race columns
  arrest_subset <- arrest_tab[arrest_tab$neighborhood == neighborhood, c("age", "sex", "race")]
  
  # only make plots if there are more than 500 arrests
  if (nrow(arrest_subset) > 500) {
    # boxplot of age conditioned on sex
    boxplot(arrest_subset$age~arrest_subset$sex, xlab="Sex", ylab="Arrest Age", main=neighborhood)
    
    # boxplot of age conditioned on race
    boxplot(arrest_subset$age~arrest_subset$race, xlab="Race", ylab="Arrest Age", main=neighborhood)
    
    # barplot of number of arrests per race
    barplot(table(arrest_subset$race), xlab="Race", ylab="Number of Arrests", main=neighborhood)
  }
}

Now this function will only produce the plots for neighborhoods with more than 500 arrests. Let’s try it out:

for (neighborhood in neighborhoods) {
  analyze_neighborhood(neighborhood)
}

Challenge: 1. Add an min_arrests argument to the analyze_neighborhood and use the value of that argument to decide if plots should be produced or not

  1. Modify the analyze_neighborhood function to only produce plots for neighborhoods where more whites were arrested than blacks.

Writing plots to file

So far, we’ve been plotting everything to Rstudio, but we may want to share our plots with collaborators. We can write these plots to, e.g., pdf:

pdf("neighborhood_plots.pdf", width=6, height=6)
for (neighborhood in neighborhoods[1:10]) {
  analyze_neighborhood(neighborhood)
}
dev.off()
## png 
##   2

The pdf function opens a pdf file for plotting. All plots we generate after that will be printed on that file until we call dev.off().

Summary

We have now seen some of the most important operations in coding data analyses: variables, indices, plotting, functions, loops and conditions. If you have time you can try some of these analyses

  1. Find the most common offenses per neighborhood
  2. Make histograms of arrest ages
  3. Look at number of arrests per month, time of day.
  4. Find the most common offenses per time of day

Here is an example of an analysis of the geographical distribution of arrests:

Geographic distribution of arrests.

First we need to extract latitude and longitude from location, we’ll use some string functions to do this

library(stringr)
regex_match <- str_match(arrest_tab$Location.1, "\\((.*),(.*)\\)")
arrest_tab$lon <- as.numeric(regex_match[,3])
arrest_tab$lat <- as.numeric(regex_match[,2])

Now let’s plot

library(maps)
library(ggplot2)

balto_map <- subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt <- ggplot(arrest_tab, aes(x=lon, y=lat)) +
        geom_polygon(data=balto_map, mapping=aes(x=long, y=lat), color="white", fill="gray40") +
        geom_point(color="blue", alpha=.1) +
        labs(title="Arrests in Baltimore", x="Longitude", y="Latitude")
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).